AITopics

2509.22484

Country: Europe > Italy > Tuscany (0.14)

Genre: Research Report > New Finding (0.47)

Industry:

Health & Medicine > Therapeutic Area > Immunology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Therapeutic Area > Neurology > Multiple Sclerosis (0.74)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (0.72)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Explanation & Argumentation (0.62)
Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (0.62)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.50)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis (0.47)

arXiv.org Artificial IntelligenceNov-27-2023

Rethinking Privacy in Machine Learning Pipelines from an Information Flow Control Perspective

Wutschitz, Lukas, Köpf, Boris, Paverd, Andrew, Rajmohan, Saravan, Salem, Ahmed, Tople, Shruti, Zanella-Béguelin, Santiago, Xia, Menglin, Rühle, Victor

Modern machine learning systems use models trained on ever-growing corpora. Typically, metadata such as ownership, access control, or licensing information is ignored during training. Instead, to mitigate privacy risks, we rely on generic techniques such as dataset sanitization and differentially private model training, with inherent privacy/utility trade-offs that hurt model performance. Moreover, these techniques have limitations in scenarios where sensitive information is shared across multiple participants and fine-grained access control is required. By ignoring metadata, we therefore miss an opportunity to better address security, privacy, and confidentiality challenges. In this paper, we take an information flow control perspective to describe machine learning systems, which allows us to leverage metadata such as access control policies and define clear-cut privacy and confidentiality guarantees with interpretable information flows. Under this perspective, we contrast two different approaches to achieve user-level non-interference: 1) fine-tuning per-user models, and 2) retrieval augmented models that access user-specific datasets at inference time. We compare these two approaches to a trivially non-interfering zero-shot baseline using a public model and to a baseline that fine-tunes this model on the whole corpus. We evaluate trained models on two datasets of scientific articles and demonstrate that retrieval augmented architectures deliver the best utility, scalability, and flexibility while satisfying strict non-interference guarantees.

information, large language model, machine learning, (15 more...)

2311.15792

Country:

North America > United States (0.14)
Asia > Middle East (0.14)

Genre: Research Report > New Finding (0.46)

Industry:

Information Technology > Security & Privacy (1.00)
Energy > Oil & Gas > Upstream (0.62)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

arXiv.org Artificial IntelligenceJan-13-2023

Text2Struct: A Machine Learning Pipeline for Mining Structured Data from Text

Zhou, Chaochao, Yang, Bo

Many analysis and prediction tasks require the extraction of structured data from unstructured texts. However, an annotation scheme and a training dataset have not been available for training machine learning models to mine structured data from text without special templates and patterns. To solve it, this paper presents an end-to-end machine learning pipeline, Text2Struct, including a text annotation scheme, training data processing, and machine learning implementation. We formulated the mining problem as the extraction of metrics and units associated with numerals in the text. Text2Struct was trained and evaluated using an annotated text dataset collected from abstracts of medical publications regarding thrombectomy. In terms of prediction performance, a dice coefficient of 0.82 was achieved on the test dataset. By random sampling, most predicted relations between numerals and entities were well matched to the ground-truth annotations. These results show that Text2Struct is viable for the mining of structured data from text without special templates or patterns. It is anticipated to further improve the pipeline by expanding the dataset and investigating other machine learning models. A code demonstration can be found at: https://github.com/zcc861007/CourseProject

machine learning, natural language, numeral, (18 more...)

2212.09044

Country:

Europe > Italy (0.04)
North America > United States > Illinois > Champaign County > Urbana (0.04)
Europe > France (0.04)

Genre:

Research Report > New Finding (0.88)
Research Report > Experimental Study (0.68)

Industry: Health & Medicine (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.89)

arXiv.org Artificial IntelligenceNov-22-2022

Pyrocast: a Machine Learning Pipeline to Forecast Pyrocumulonimbus (PyroCb) Clouds

Tazi, Kenza, Salas-Porras, Emiliano Díaz, Braude, Ashwin, Okoh, Daniel, Lamb, Kara D., Watson-Parris, Duncan, Harder, Paula, Meinert, Nis

Pyrocumulonimbus (pyroCb) clouds are storm clouds generated by extreme wildfires. PyroCbs are associated with unpredictable, and therefore dangerous, wildfire spread. They can also inject smoke particles and trace gases into the upper troposphere and lower stratosphere, affecting the Earth's climate. As global temperatures increase, these previously rare events are becoming more common. Being able to predict which fires are likely to generate pyroCb is therefore key to climate adaptation in wildfire-prone areas. This paper introduces Pyrocast, a pipeline for pyroCb analysis and forecasting. The pipeline's first two components, a pyroCb database and a pyroCb forecast model, are presented. The database brings together geostationary imagery and environmental data for over 148 pyroCb events across North America, Australia, and Russia between 2018 and 2022. Random Forests, Convolutional Neural Networks (CNNs), and CNNs pretrained with Auto-Encoders were tested to predict the generation of pyroCb for a given fire six hours in advance. The best model predicted pyroCb with an AUC of $0.90 \pm 0.04$.

artificial intelligence, machine learning, wildfire, (17 more...)

2211.13052

Country:

Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.28)
Oceania > Australia (0.26)
Europe > Russia (0.25)
(9 more...)

Genre: Research Report (0.50)

Industry: Government (0.69)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)

#artificialintelligenceOct-10-2022, 19:06:34 GMT

Update Your Machine Learning Pipeline With vetiver and Quarto

Machine learning operations (MLOps) are a set of best practices for running machine learning models successfully in production environments. Data scientists and system administrators have expanding options for setting up their pipeline. However, while many tools exist for preparing data and training models, there is a lack of streamlined tooling for tasks like putting a model in production, maintaining the model, or monitoring performance. Enter vetiver, an open-source framework for the entire model lifecycle. Vetiver provides R and Python programmers with a fluid, unified way of working with machine learning models.

machine learning pipeline, rstudio connect, vetiver, (12 more...)

Country: North America > United States > District of Columbia > Washington (0.05)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

#artificialintelligenceSep-13-2022, 22:10:09 GMT

Update Your Machine Learning Pipeline With vetiver and Quarto

machine learning pipeline, rstudio connect, vetiver, (12 more...)

Country: North America > United States > District of Columbia > Washington (0.05)

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

#artificialintelligenceAug-2-2022, 15:19:31 GMT

Machine Learning Pipelines

In this use case, we will be using the Titanic dataset. In this dataset, we will apply some common Transformers on certain columns and then we will use a Decision Tree Estimator to classify whether the passenger will live or die. Here is the plan outline for our use case. To make our use case easy to understand, let us see the diagram below. This will give you a fairly good understanding of the pipeline visually.

machine learning pipeline, pipeline, transformer, (9 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Decision Tree Learning (0.37)

#artificialintelligenceJun-29-2022, 08:30:28 GMT

Building Deep Learning Pipelines with Tensorflow Extended

You can check the code for this tutorial here. Once you finish your model experimentation it is time to roll things to production. Rolling Machine Learning to production is not just a question of wrapping the model binaries with a REST API and starting to serve it, but and making it possible to re-create (or update) and re-deploy your model. That means the steps from preprocessing data to training the model to roll it to production (we call this a Machine Learning Pipeline) should be deployed and able to be run as easily as possible while making it possible to track it and parameterize it (to use different data, for example). In this post, we will see how to build a Machine Learning Pipeline for a Deep Learning model using Tensorflow Extended (TFx), how to run and deploy it to Google Vertex AI and why should we use it.

learning pipeline, pipeline, vertex ai, (14 more...)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

#artificialintelligenceJun-10-2022, 17:55:41 GMT

Building A Machine Learning Pipeline Using Pyspark - Analytics Vidhya

This article was published as a part of the Data Science Blogathon. Spark is an open-source framework for big data processing. It was originally written in scala and later on due to increasing demand for machine learning using big data a python API of the same was released. So, Pyspark is a Python API for spark. Pyspark can effectively work with spark components such as spark SQL, Mllib, and Streaming that lets us leverage the true potential of Big data and Machine Learning.

analytic vidhya, pipeline, pyspark, (12 more...)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

#artificialintelligenceApr-27-2022, 08:35:34 GMT

What Are the Benefits of a Machine Learning Pipeline?

Data is being collected by the growing number of companies with the aim of using Machine Learning (ML). However, while most machine learning algorithms can only view clean datasets, real-world data is typically unorganized and complicated.

machine learning pipeline

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)